Biniam Abebe - 04/20/2024

Hands-on Assignment

Complete the following two sections on Supervised Machine Learning:

Linear and Logistic Regression

Part 1: Linear Regression

Machine Learning Supervised Linear Regression

SL.jpg

STEP 1: Import Libraries

WORKFLOW: DATA SET

STEP 2: Read data description and Load the Data

Description of Boston Housing Dataset

STEP 3: Give names to the columns since there are no headers

WORKFLOW: Clean and Preprocess the Dataset

STEP 4: Clean the data

STEP 5: Performing the Exploratory Data Analysis (EDA)

STEP 5A: Create Histograms

STEP 5B: Create Density Plots

STEP 5C: Create Boxplots

STEP 5D: Pair Plots - Correlation Analysis and Feature Selection

STEP 5E: Creating Heatmaps

WORKFLOW: DATA SPLIT

STEP 6: Separate the Dataset into Input & Output NumPy Arrays

STEP 7: Split into Input/Output Array into Training/Testing Datasets

WORKFLOW: TRAIN MODEL

STEP 8: Build and Train the Model

WORKFLOW: SCORE MODEL

STEP 9: Calculate R-Squared

** Note: The higher the R-squared, the better (0 – 100%). Depending on the model, the best models score above 83%. The R-squared value tells us how well the independent variables predict the dependent variable, which is very low. Think about how you could increase the R-squared. What variables would you use?

Step 10: Prediction

** Note: The model predicts that the median value of owner-occupied homes in 1000 dollars in the above suburb should be around $24,144.

WORKFLOW: EVALUATE MODELS

Step 11: Train & Score Model 2 Using K-Fold Cross Validation Data Split

Note: After we train, we evaluate. We are using K-fold to determine if the model is acceptable. We pass the whole set since the system will divide it for us. We see a -64 avg of all errors (mean of square errors). This value would traditionally be positive, but scikit reports this value as a negative value. If the square root had been evaluated, the value would have been around 8.

Step 12: Score Using Explained Variance

Let's use a different scoring parameter. Here we use the Explained Variance. The best possible score is 1.0; lower values are worse.

To learn more about Scikit Learning scoring [https://scikitlearn.org/stable/modules/model_evaluation.html (Links to an external site.)]


Part 2: Logistic Regression

Machine Learning Supervised Logistic Regression

• Let's begin Part 2 using logistic regression using the same Supervised Learning Workflow used in part 1.

STEP 1: Import Libraries

WORKFLOW: DATA SET

STEP 2: Read data description and Load the Data

Description Iris Dataset

Data Set: iris.csv

Title: Iris Plants Database Updated Sept 21 by C. Blake -Added discrepancy information Sources:

Relevant Information: This is perhaps the best-known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example)

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

Predicted attribute: class of Iris plant

Number of Instances: 150 (50 in each of three classes)

Number of predictors: 4 numeric

Predictive attributes and the class attribute information:

class:

flower.jpg

WORKFLOW: Clean and Preprocess the Dataset

STEP 3: Clean the data

STEP 4: Performing the Exploratory Data Analysis (EDA)

STEP 4A: Create Histograms

Step 4B: Creating Boxplots

Step 4C: Create Pair Plots

Note: Please click on the above URL to learn more about Pair Plots

https://seaborn.pydata.org/generated/seaborn.pairplot.html

Step 4D: Creating Violin Plots

Note: Please click on the above URL to learn more about Violin Plots

https://seaborn.pydata.org/generated/seaborn.violinplot.html

WORKFLOW: DATA SPLIT

STEP 5: Separate the Dataset into Input & Output NumPy Arrays

STEP 6: Split into Input/Output Array into Training/Testing Datasets

WORKFLOW: TRAIN MODEL

STEP 7: Build and Train the Model

WORKFLOW: SCORE MODEL 1

STEP 8: Score the Accuracy of the Model

Step 9: Prediction

Note: We have now trained the model and using that trained model to predict the type of flower we have with the listed values for each variable.

WORKFLOW: EVALUATE MODELS

Step 10: Train & Score Model 2 Using K-Fold Cross Validation Data Split

GREAT JOB! YOU ARE DONE.